Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF/BUG: reimplement MultiIndex.remove_unused_levels #16565

Merged

Conversation

rhendric
Copy link
Contributor

@rhendric rhendric commented Jun 1, 2017

  • Add a large random test case for remove_unused_levels that failed the
    previous implementation

  • Fix PERF: remove_unused_levels is very slow #16556, a performance issue with the previous implementation

  • Add inplace functionality

  • Always return (if not inplace) at least a view instead of the original
    index

@rhendric rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch from 4231e0c to dbd9ddc Compare June 1, 2017 02:18
@codecov
Copy link

codecov bot commented Jun 1, 2017

Codecov Report

Merging #16565 into master will decrease coverage by 50.59%.
The diff coverage is 6.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16565      +/-   ##
==========================================
- Coverage   90.75%   40.16%   -50.6%     
==========================================
  Files         161      161              
  Lines       51073    51078       +5     
==========================================
- Hits        46350    20513   -25837     
- Misses       4723    30565   +25842
Flag Coverage Δ
#multiple ?
#single 40.16% <6.25%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 32.27% <6.25%> (-64.38%) ⬇️
pandas/types/concat.py 0% <0%> (-100%) ⬇️
pandas/tools/hashing.py 0% <0%> (-100%) ⬇️
pandas/parser.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tslib.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-90.67%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.28%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
pandas/io/json/normalize.py 8.16% <0%> (-88.78%) ⬇️
... and 111 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee8346d...dbd9ddc. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 1, 2017

Codecov Report

Merging #16565 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16565      +/-   ##
==========================================
- Coverage   90.75%   90.75%   -0.01%     
==========================================
  Files         161      161              
  Lines       51095    51097       +2     
==========================================
+ Hits        46370    46371       +1     
- Misses       4725     4726       +1
Flag Coverage Δ
#multiple 88.59% <100%> (-0.01%) ⬇️
#single 40.16% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/multi.py 96.65% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.23% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db419bf...83bdc59. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. pls add a 0.20.2 bug fix indexing (you can also say perf improvement there), whatsnew note. we r releasing tomorrow, so do today if you can.

@@ -1263,6 +1263,11 @@ def remove_unused_levels(self):

.. versionadded:: 0.20.0

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indexes are immutable. We never use inplace. remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, are you sure? Index.set_names, MultiIndex.set_levels, and MultiIndex.set_labels already use inplace. And unlike the latter two functions, remove_unused_levels can't change the result of values, meaning it should be very safe to use in place since it won't change the semantics of any data structures using the index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i am sure
those are terrible patterns for an immutable object and should be removed

# nothing changed
if not changed.any():
return self
if inplace:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use shallow_copy here

assert result2.is_(result)

def test_remove_unused_levels_large(self):
# because tests should be deterministic:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment


def test_remove_unused_levels_large(self):
# because tests should be deterministic:
rng = np.random.RandomState(4) # chosen by fair dice roll.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use random state here

assert len(result.levels[1]) < len(df.index.levels[1])
assert result.equals(df.index)

# in place
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

result = df.index.remove_unused_levels()
assert len(result.levels[0]) < len(df.index.levels[0])
assert len(result.levels[1]) < len(df.index.levels[1])
assert result.equals(df.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tests with different things for the levels, e.g. add another example with dates & strings

assert len(result.levels[0]) < len(df.index.levels[0])
assert len(result.levels[1]) < len(df.index.levels[1])
assert result.equals(df.index)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also construct that this IS equal to .reset_index(..).set_index(...).index. (for each example)

@jreback jreback added Bug MultiIndex Performance Memory or execution speed performance labels Jun 1, 2017
@jreback jreback added this to the 0.20.2 milestone Jun 1, 2017
@rhendric rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch 2 times, most recently from 8a9fe43 to 62c5907 Compare June 1, 2017 15:17
@rhendric
Copy link
Contributor Author

rhendric commented Jun 1, 2017

@jreback, I removed RandomState from the tests as you asked, but the test now occasionally fails for me when I run pytest because of the non-determinism. (The levels are no longer guaranteed to be strictly smaller in the output.) Is this actually better? I'd rather not change the inequality checks to be <= because I want to assert that remove_unused_levels is doing something. Which is best: a weaker test, a test that has a small random chance of failing, or a test that uses RandomState?

(Oh look, it just failed like this in CI.)

assert result.equals(df.index)

expected = df.reset_index().set_index(['first', 'second']).index
assert result.equals(expected)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should replace this with the stronger tm.assert_index_equal(). I plan to do so when updating this test based on what you tell me to do about the non-determinism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert result.equals(df.index)

expected = df.reset_index().set_index(['first', 'second']).index
assert result.equals(expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

@jreback, I removed RandomState from the tests as you asked, but the test now occasionally fails for me when I run pytest because of the non-determinism. (The levels are no longer guaranteed to be strictly smaller in the output.) Is this actually better? I'd rather not change the inequality checks to be <= because I want to assert that remove_unused_levels is doing something. Which is best: a weaker test, a test that has a small random chance of failing, or a test that uses RandomState?

oh that's fine. wasn't really sure why you were explicity setting the random. yes sometime you will not remove levels based on the filter, so ok putting it back.

@rhendric rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch from 62c5907 to 9852c6f Compare June 1, 2017 17:37
@jreback
Copy link
Contributor

jreback commented Jun 1, 2017

@rhendric lgtm. can you add an asv for this (asv/benchmarks/indexing.py) is ok. you can use your test example (just the int/int is fine). and pls post the result.

* Add a large random test case for remove_unused_levels that failed the
previous implementation

* Fix pandas-dev#16556, a performance issue with the previous implementation

* Always return at least a view instead of the original index
@rhendric
Copy link
Contributor Author

rhendric commented Jun 2, 2017

I'm new to asv; what is the result you want me to post? Console output? A file somewhere?

@rhendric rhendric force-pushed the fix/MultiIndex.remove_unused_levels branch from 9852c6f to 83bdc59 Compare June 2, 2017 00:03
@jreback
Copy link
Contributor

jreback commented Jun 2, 2017

implement a suitable asv. then you run asv continuous master yourbranch -b regexttoselectit

it should run and then post results. see http://pandas-docs.github.io/pandas-docs-travis/contributing.html#running-the-performance-test-suite

@@ -248,6 +254,9 @@ def time_multiindex_small_get_loc_warm(self):
def time_is_monotonic(self):
self.miint.is_monotonic

def time_remove_unused_levels(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

@rhendric
Copy link
Contributor Author

rhendric commented Jun 2, 2017

       before           after         ratio
     [fb47ee56]       [83bdc590]
-         185±2ms          846±2μs     0.00  indexing.MultiIndexing.time_remove_unused_levels

@jreback
Copy link
Contributor

jreback commented Jun 2, 2017

ok that's slightly better 😄

@jreback jreback changed the title BUG: reimplement MultiIndex.remove_unused_levels PERF/BUG: reimplement MultiIndex.remove_unused_levels Jun 2, 2017
@jreback jreback merged commit 8d092d9 into pandas-dev:master Jun 2, 2017
@jreback
Copy link
Contributor

jreback commented Jun 2, 2017

thanks!

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 2, 2017
TomAugspurger pushed a commit that referenced this pull request Jun 4, 2017
Kiv pushed a commit to Kiv/pandas that referenced this pull request Jun 11, 2017
stangirala pushed a commit to stangirala/pandas that referenced this pull request Jun 11, 2017
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Jul 12, 2017
Version 0.20.2

* tag 'v0.20.2': (68 commits)
  RLS: v0.20.2
  DOC: Update release.rst
  DOC: Whatsnew fixups (pandas-dev#16596)
  ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460)
  BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444)
  PERF: vectorize _interp_limit (pandas-dev#16592)
  DOC: whatsnew 0.20.2 edits (pandas-dev#16587)
  API: Make is_strictly_monotonic_* private (pandas-dev#16576)
  BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565)
  Strictly monotonic (pandas-dev#16555)
  ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026)
  fix linting
  BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244)
  BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317)
  return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486)
  BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549)
  BUG: Fixed tput output on windows (pandas-dev#16496)
  Strictly monotonic (pandas-dev#16555)
  BUG: fixed wrong order of ordered labels in pd.cut()
  BUG: Fixed to_html ignoring index_names parameter
  ...
@rhendric rhendric deleted the fix/MultiIndex.remove_unused_levels branch November 24, 2022 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants